Topics Covered

In this brief presentation, we’ll be introducing the following items:

  • Factor Data Types
  • The library
  • Workflows & Pipes
  • Tables

Categorical Data Types

 

Unique and individual grouping that can be applied to a study design.

  • Case sensitive
  • Can be ordinal
  • Typically defined as character type
weekdays <- c("Monday","Tuesday","Wednesday",
              "Thursday","Friday","Saturday", 
              "Sunday")
class( weekdays )
[1] "character"
weekdays
[1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"    "Saturday" 
[7] "Sunday"   

Making Up Data 🤷🏻‍

The function sample() allows us to take a random sample of elements from a vector of potential values.

chooseOne <- sample( c("Heads","Tails"), size=1 )
chooseOne
[1] "Heads"

Making Up More Data 🤷🏻‍

However, if we want a large number items, we can have them with or without replacement.

sample( c("Heads","Tails"), size=10, replace=TRUE )
 [1] "Heads" "Heads" "Tails" "Tails" "Heads" "Tails" "Heads" "Tails" "Tails"
[10] "Tails"

Weekdays as example

We’ll pretend we have a bunch of data related to the day of the week.

days <- sample( weekdays, size=40, replace=TRUE)
summary( days )
   Length     Class      Mode 
       40 character character 
days
 [1] "Monday"    "Tuesday"   "Sunday"    "Sunday"    "Tuesday"   "Wednesday"
 [7] "Wednesday" "Friday"    "Friday"    "Wednesday" "Wednesday" "Saturday" 
[13] "Wednesday" "Thursday"  "Thursday"  "Tuesday"   "Thursday"  "Sunday"   
[19] "Monday"    "Wednesday" "Thursday"  "Thursday"  "Monday"    "Monday"   
[25] "Friday"    "Friday"    "Monday"    "Sunday"    "Tuesday"   "Thursday" 
[31] "Tuesday"   "Saturday"  "Saturday"  "Wednesday" "Sunday"    "Thursday" 
[37] "Wednesday" "Sunday"    "Wednesday" "Sunday"   

Turn it into a factor

data <- factor( days )
is.factor( data )
[1] TRUE
class( data )
[1] "factor"

Data Type Specific Printing & Summaries

data
 [1] Monday    Tuesday   Sunday    Sunday    Tuesday   Wednesday Wednesday
 [8] Friday    Friday    Wednesday Wednesday Saturday  Wednesday Thursday 
[15] Thursday  Tuesday   Thursday  Sunday    Monday    Wednesday Thursday 
[22] Thursday  Monday    Monday    Friday    Friday    Monday    Sunday   
[29] Tuesday   Thursday  Tuesday   Saturday  Saturday  Wednesday Sunday   
[36] Thursday  Wednesday Sunday    Wednesday Sunday   
Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday

Factor Levels

Each factor variable is defined by the levels that constitute the data. This is a .red[finite] set of unique values

levels( data)
[1] "Friday"    "Monday"    "Saturday"  "Sunday"    "Thursday"  "Tuesday"  
[7] "Wednesday"

Factor Ordination

If a factor is not ordinal, it does nota allow the use relational comparison operators.

data[1] < data[2]
Warning in Ops.factor(data[1], data[2]): '<' not meaningful for factors
[1] NA

Ordination = Ordered

is.ordered( data )
[1] FALSE

Ordination of Factors

Where ordination matters:

  • Fertilizer Treatments in KG of N2 per hectare: 10 kg N2, 20 N2, 30 N2,

  • Days of the Week: Friday is not followed by Monday,

  • Life History Stage: seed, seedling, juvenile, adult, etc.

Where ordination is irrelevant:

  • River

  • State or Region

  • Sample Location

Making Ordered Factors

data <- factor( days, ordered = TRUE)
is.ordered( data )
[1] TRUE
data
 [1] Monday    Tuesday   Sunday    Sunday    Tuesday   Wednesday Wednesday
 [8] Friday    Friday    Wednesday Wednesday Saturday  Wednesday Thursday 
[15] Thursday  Tuesday   Thursday  Sunday    Monday    Wednesday Thursday 
[22] Thursday  Monday    Monday    Friday    Friday    Monday    Sunday   
[29] Tuesday   Thursday  Tuesday   Saturday  Saturday  Wednesday Sunday   
[36] Thursday  Wednesday Sunday    Wednesday Sunday   
7 Levels: Friday < Monday < Saturday < Sunday < Thursday < ... < Wednesday

The problem is that the default ordering is actually alphabetical!

Specifying the Order

Specifying the Order of Ordinal Factors

data <- factor( days, ordered = TRUE, levels = weekdays)
data
 [1] Monday    Tuesday   Sunday    Sunday    Tuesday   Wednesday Wednesday
 [8] Friday    Friday    Wednesday Wednesday Saturday  Wednesday Thursday 
[15] Thursday  Tuesday   Thursday  Sunday    Monday    Wednesday Thursday 
[22] Thursday  Monday    Monday    Friday    Friday    Monday    Sunday   
[29] Tuesday   Thursday  Tuesday   Saturday  Saturday  Wednesday Sunday   
[36] Thursday  Wednesday Sunday    Wednesday Sunday   
7 Levels: Monday < Tuesday < Wednesday < Thursday < Friday < ... < Sunday

Sorting Is Now Relevant

sort( data )
 [1] Monday    Monday    Monday    Monday    Monday    Tuesday   Tuesday  
 [8] Tuesday   Tuesday   Tuesday   Wednesday Wednesday Wednesday Wednesday
[15] Wednesday Wednesday Wednesday Wednesday Wednesday Thursday  Thursday 
[22] Thursday  Thursday  Thursday  Thursday  Thursday  Friday    Friday   
[29] Friday    Friday    Saturday  Saturday  Saturday  Sunday    Sunday   
[36] Sunday    Sunday    Sunday    Sunday    Sunday   
7 Levels: Monday < Tuesday < Wednesday < Thursday < Friday < ... < Sunday

Fixed Set of Levels

You cannot assign a value to a factor that is not one of the pre-defined levels.

data[3] <- "Bob"
Warning in `[<-.factor`(`*tmp*`, 3, value = "Bob"): invalid factor level, NA
generated

The forcats library

Part of the tidyverse group of packages.

library( tidyverse )
library(forcats)

This library has a lot of helper functions that make working with factors a bit easier. I’m going to give you a few examples here but strongly encourage you to look a the cheat sheet for all the other options.

StarWars API

There is a StarWars API at https://swapi.py4e.com, see ?starwars to learn more about the data it contains. Let’s take this data to play with the library.

 

names(starwars)
 [1] "name"       "height"     "mass"       "hair_color" "skin_color"
 [6] "eye_color"  "birth_year" "sex"        "gender"     "homeworld" 
[11] "species"    "films"      "vehicles"   "starships" 

Homeworld as Factor

starwars |>
  filter( !is.na(homeworld), !is.na(mass) ) |>
  mutate( homeworld = factor( homeworld ) ) -> df
df$homeworld
 [1] Tatooine       Tatooine       Naboo          Tatooine       Alderaan      
 [6] Tatooine       Tatooine       Tatooine       Tatooine       Stewjon       
[11] Tatooine       Kashyyyk       Corellia       Rodia          Nal Hutta     
[16] Corellia       Bestine IV     Naboo          Kamino         Trandosha     
[21] Socorro        Bespin         Mon Cala       Endor          Sullust       
[26] Cato Neimoidia Naboo          Naboo          Naboo          Malastare     
[31] Dathomir       Ryloth         Aleen Minor    Vulpter        Tund          
[36] Haruun Kal     Cerea          Glee Anselm    Coruscant      Dorin         
[41] Naboo          Geonosis       Mirial         Mirial         Serenno       
[46] Concord Dawn   Zolan          Ojom           Kamino         Skako         
[51] Shili          Kalee          Kashyyyk       Alderaan       Umbara        
[56] Utapau        
39 Levels: Alderaan Aleen Minor Bespin Bestine IV Cato Neimoidia ... Zolan

Homeworld as Ordered

starwars |>
  filter( !is.na(homeworld), !is.na(mass) ) |>
  mutate( homeworld = factor( homeworld, ordered=TRUE ) ) -> df
df$homeworld
 [1] Tatooine       Tatooine       Naboo          Tatooine       Alderaan      
 [6] Tatooine       Tatooine       Tatooine       Tatooine       Stewjon       
[11] Tatooine       Kashyyyk       Corellia       Rodia          Nal Hutta     
[16] Corellia       Bestine IV     Naboo          Kamino         Trandosha     
[21] Socorro        Bespin         Mon Cala       Endor          Sullust       
[26] Cato Neimoidia Naboo          Naboo          Naboo          Malastare     
[31] Dathomir       Ryloth         Aleen Minor    Vulpter        Tund          
[36] Haruun Kal     Cerea          Glee Anselm    Coruscant      Dorin         
[41] Naboo          Geonosis       Mirial         Mirial         Serenno       
[46] Concord Dawn   Zolan          Ojom           Kamino         Skako         
[51] Shili          Kalee          Kashyyyk       Alderaan       Umbara        
[56] Utapau        
39 Levels: Alderaan < Aleen Minor < Bespin < Bestine IV < ... < Zolan

Visualizing Counts of Levels

df |> 
  ggplot( aes(x=homeworld) ) + 
  geom_bar() + 
  coord_flip()

Occurrence Based Order In data.frame

df |> 
  mutate( homeworld = fct_inorder( homeworld ) ) |> 
  ggplot( aes(x=homeworld) ) + 
  geom_bar() + 
  coord_flip()

Reordering By Other Valiable

df |> 
  mutate( homeworld = fct_reorder( homeworld, mass )) |> 
  ggplot( aes(x=homeworld) ) + 
  geom_bar() + 
  coord_flip()

Frequency Based Ordering

df |> 
  mutate( homeworld = fct_infreq( homeworld ) ) |> 
  ggplot( aes(x=homeworld) ) + 
  geom_bar() + 
  coord_flip()

Reversing Orders

df |> 
  mutate( homeworld = fct_rev( fct_infreq( homeworld ) ) ) |> 
  ggplot( aes(x=homeworld) ) + 
  geom_bar() + 
  coord_flip()

Recoding Factors

New Value = Old Value

starwars |> 
  filter( !is.na(homeworld) ) |>
  mutate( homeworld = fct_recode(homeworld, 
                                 "Homeworld of Zolander" = "Zolan",
                                 "Outter Rim Territory Tund" = "Tund"
  ) ) |>
  ggplot( aes(x=homeworld) ) + 
  geom_bar() + 
  coord_flip()

Collapsing A Few Selected Levels

starwars |>
  filter( !is.na(homeworld) ) |>
  mutate( homeworld = fct_collapse( homeworld, 
                                    "<---- MEH ---->" = c("Bestine IV","Cerea", "Dorin","Miral", "Sullust"),
                                    \\_(ツ)_/¯" = c("Umbara","Kashyyyk","Concord Dawn"),
                                    "YES YES YES YES YSE YSE " = c("Nal Hutta","Ojom","Rodia","Ryloth","Serenno","Shili","Skako","Socorro")
  )) |>
  ggplot( aes(x=homeworld) ) + 
  geom_bar() + 
  coord_flip()

Lumping Based On Counts

starwars |> 
  filter( !is.na(homeworld) ) |>
  mutate( homeworld = fct_lump_min(homeworld, 2)) |>
  ggplot( aes(x=homeworld) ) + 
  geom_bar() + 
  coord_flip()

Lumping into N Groups

starwars |>
  filter( !is.na(homeworld) ) |>
  mutate( homeworld = fct_lump_n(homeworld, n=5)) |>
  ggplot( aes(x=homeworld) ) + 
  geom_bar() + 
  coord_flip()

Counting Factors & Occurances

df |> 
  group_by( homeworld ) |> 
  summarize( film = length(unique(films) ) ) |> 
  arrange( -film )
# A tibble: 39 × 2
   homeworld    film
   <ord>       <int>
 1 Naboo           6
 2 Tatooine        6
 3 Alderaan        2
 4 Corellia        2
 5 Kamino          2
 6 Kashyyyk        2
 7 Mirial          2
 8 Aleen Minor     1
 9 Bespin          1
10 Bestine IV      1
# ℹ 29 more rows

j

Questions

If you have any questions, please feel free to either post them as an “Issue” on your copy of this GitHub Repository, post to the Canvas discussion board for the class, or drop me an email.

Peter Sellers looking bored